2023-03-03

Aims for today

  • Complex data structures (matrices, lists and data frames)
  • Functions in R
  • Reading data from files

Repetitorium of Day 1

Questions, anybody?

Functions in R: a brief introduction

samples   <- c(1, 10, 23, 42, 13)
samples_n <- length(samples)
  • Both c and length are functions. They take some arguments (often many of them) and return a single object: a vector, a matrix or something else.

  • You can always assign the result of the function to a variable.

  • Sometimes functions return NULL, which is R for “nothing”, but which is still something you can use or assign.

Exercise 2.1 More on vectors (and functions)

There is a lake in a garden. Every day, the water lilies cover twice as much area as the previous day. On the first day, the water lilies cover 1/100th of the area of the lake.

  • What is the formula to calculate the area covered by the water lilies on day n? (pen and paper, not in Rstudio!)
  • assign days 1 … 10 to variable x (use c() or :). Now assign the fraction of the area covered by water lilies on day n to variable y
  • plot x vs y (use the simple plot() function)
  • When will the half of the area be covered by water lilies? Use abline(h=.5) (what does it do?) to show a graphical solution
  • What is the fraction on day 3? Hint: what does y[3] do?
  • Speaking of which, what does y[4:5] do? (just try it!)

Exercise 2.2: More on plotting

  • Make the plot yourself.
  • use ‘col’ parameter of plot and a color name (e.g. “red”) to change the color of the line (plot(..., col="red"))
  • what if each day the lilies cover a fraction of area that is 1.5 the fraction of area on the previous day?
  • Use lines(x, y) function to put a second line on the plot
  • For more experienced users: Create a Rmarkdown.

Logical vectors

In R, there are two special values: TRUE and FALSE. They can be used to create logical vectors.

sel     <- c(TRUE, TRUE, TRUE, TRUE, FALSE)
sel
!sel

Comparison operators (>, <, <=, >=, ==, !=) produce logical vectors:

samples <- c(1, 1, 2, 5, 7)
samples > 2
which(samples == 7)
which(samples != 1)

Logical vectors

Logical vectors can be used to access elements:

persons <- c("Aphrodite", "Bacchus", "Circe", "Demeter", "Eurypides")
sel     <- c(TRUE, TRUE, TRUE, TRUE, FALSE)
persons[sel]

# we can abbreviate the TRUE and FALSE to T and F (avoid)
greek <- persons[ c(T, F, T, T, T) ]

Exercise 2.3: more on vectors and functions, introducing NA

Create a vector as follows:

samples <- c(1, 10, NA, 15)

NA stands for not available (e.g., missing data)

  • try adding a number to that vector
  • what does length(samples) return?
  • what does mean(samples) return? Why is that?
  • use the named parameter na.rm=TRUE for the mean() function. Look up help (?mean) to see how it can be used. What happens now?
  • what does the is.na() function return when applied to samples?
  • how do you find NA values? Use is.na and which
  • How do you select only these samples which are not NA?

Extra (if you are bored): Creating a function

The reason we are showing how to create a function is to show you that it is simple, and also because it will help you understand what functions are.

#' Function name
#' Function description
some_name <- function(param1, param2=2) {

## code comment
# <your code goes in here>

}
  • If you have to write something more than twice, you might want to turn it into a function!
  • R is a functional language:
    • almost everything is a function
    • functions usually take something and return a value; they are not supposed to use or modify anything that they have not been passed to explicitely (sometimes hard to avoid)

Complex data structures

Main data representations in R

  • vectors (string, number, integer, logic, factor)
  • matrices and arrays
  • lists
  • data frames

Data matrices

Data matrices

Much like vectors, matrices can only hold one data type (e.g. only numeric or only character or only logical etc.).

m <- matrix(1:18, ncol=3, nrow=6)
# compare with
m <- matrix(1:18, ncol=3, nrow=6, byrow=TRUE)
dim(m)
ncol(m)
nrow(m)

Accessing columns, rows and element

matrix[row, column]

So, for example:

m[1, ] # vector which is the first row
m[, 2] # vector which is the first column
m[3, 1] # first element of the third row

One more thing about vectors: named indices

Elements of a vector can be accessed not only using numbers (indices) or logical vectors. You can assign names to vectors:

person <- c("January", "Weiner", 134)
names(person) <- c("FirstName", "LastName", "Age")
person["FirstName"]
person["Age"]

Row and column names

We can name rows and columns of a matrix and use the names to access the rows and columns:

colnames(m) <- letters[1:ncol(m)]
rownames(m) <- LETTERS[1:nrow(m)]

m["A", "b"]  # one "cell"
m["B", ]     # one row
m[   , "b"]  # one column

Remember

  • If you select a single column or a single row, you will get a vector
  • If you select more than one row or column, you will get a (smaller) matrix
  • If you select more rows or columns than are present, you will get a “subscript out of bonds” error
  • REMEMBER: Vectors and matrices always have only one data type (string, integer etc.)

Demonstration

Lists

  • Lists hold all types of information
  • Lists are really, really cool
  • Lists have elements. An element of a list can have any type, including another list.
  • You create lists using list() function
person <- list(name="Weiner", 
               Age=NA, 
               given="January")

Accessing lists

To access an element of a list, you need to use double brackets [[

person[["name"]]

There is a shortcut:

person$name

Accessing lists

If you use single brackets [, you will get a piece of the “clothesline”, that is, you will produce a smaller list.

person["name"]
class(person)

Lists

Caveats:

  • You access elements of a list using [[, not [
  • Lists may have names (set with names()), but don’t have to

Data frames

Data frames are a bit like matrices, but every column can store different type of data. In this, they are more like lists (which they in fact are).

names <- c("January", "Manuela", "Bill")
lastn <- c("Weiner", "Benary", "Gates")
age   <- c(1001, NA, 65)

d <- data.frame(names=names, last_names=lastn, age=age)
class(d)
class(d[,1])
class(d[,3])

Accessing elements in data frames

You can access the data frame elements much like the elements of a matrix.

However, since data frames are lists, the list operator ($) also works:

d$names # same as d[,1] or d[, "names"]
d$lastn
d$lastn[1]

However, note that when you select a row, you will get a data frame, not a vector. This is because each of the column can be of different type, and vectors can hold only one type of data.

Matrices vs data frames

Caveats:

  • data frames sometimes turn strings to factors (more on that later), which may have really disastrous consequences, use stringsAsFactors=FALSE
  • small typos can turn a numeric vector into a string
  • factors are dangerous to work with, use them cautiously

Gory details: matrices are a basic data type. Data frames are a list.

Data frames vs tibble

Caveats:

  • tibbles are the data frames from tidyverse
  • Whatever you can do to a data frame, you can do to a tibble as well
  • read_* functions return a tibble
  • tibble do not have row names
  • If you select a single row in a data frame, you get a smaller data frame. If you select a single column, you get a vector.
  • In tibble, you always get a smaller tibble.

Exercise 2.4

  • Create a 5x3 matrix with random numbers. Use matrix and rnorm.
  • Turn the matrix into a data frame. Use as.data.frame for that.
  • Add column and row names.
  • Add a column. Each value in the column should be “A” (a string). Use the rep function for that.
  • Add a column with five numbers from 0 to 1. Use the seq function for that.

Using rep

rep() is used to replicate vectors.

rep(c("A", "B"), 5) 
# result:
#  [1] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"

rep(c("A", "B"), each=5)
# result
# [1] "A" "A" "A" "A" "A" "B" "B" "B" "B" "B"

Reading and writing data

Reading data

Main data types you will encounter:

Data type Function Package Notes
Columns separated by spaces read_table() readr one or more spaces separate each column
TSV / TAB separated values read_tsv() readr Delimiter is tab (\t).
CSV / comma separated read_csv() readr Comma separated values
Any delimiter read_delim() readr Customizable
XLS (old Excel) read_xls() read_excel() readxl Just don’t use it. From the readxl package.
XLSX (new Excel) read_xlsx() read_excel() readxl From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame!

Note: there are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.

Excercise 2.5

Read, inspect the following files:

  • TB_ORD_Gambia_Sutherland_biochemicals.csv
  • iris.csv
  • meta_data_botched.xlsx
  1. Which functions would you use?
  2. What kind of issues can you detect?
  3. Suggestions of solving the issues?

The function readxl_example("deaths.xls") returns a file name. Read this file. How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel). How can you force the date columns to be interpreted as dates and not numbers?

Tibbles / readxl

tibbles belong to the tidyverse. They are nice to work with and very useful, but we can stick to data frames for now. Therefore, do

mydataframe <- as.data.frame(read_xlsx("file.xlsx"))

One crucial difference between tibble and data frame is that tibble[ , 1 ] returns a tibble, while dataframe[ , 1] returns a vector. The second crucial difference is that it does not support row names (on purpose!).